Add some additional information to customize the knitted document:

date: "September 24, 2020"
output:
  html_document:
    number_sections: yes
    theme: cerulean
    toc: yes
    toc_depth: 5
    toc_float: yes
  pdf_document:
    toc: yes
    toc_depth: '5'

This will add a table of contents (toc) and will change the colors (theme: cerulean)

To find your favorite Rmarkdown theme: https://www.datadreaming.org/post/r-markdown-theme-gallery/

knitr::opts_chunk$set(cache=TRUE, fig.path='figures/', fig.width=8, fig.height=5 )

This saves all figures in the directory figures and sets the default figure size

1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Rmarkdown Cheatsheet: https://rmarkdown.rstudio.com/lesson-15.html

“#” hash signs indicate headers.

The number of hashes equals the header level.

1.1 h2

1.1.1 h3

1.1.2 h4

placing a single asterisk on either side of a phrase makes it italic.

double asterisks make a word or phrase bold.

triple asterisks make a word or phrase bold and italic.

  • a single asterisk at the beginning of a line makes a bullet
  1. and a number at the begining of a line creates a numbered item.
  2. this should add

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Execute this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

1.2 Including Plots & Images

You can also embed plots, for example:

(Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.)

echo =FALSE will only display the output, not the code.

Some more chunk options: * Use echo=FALSE to avoid having the code itself shown. * Use results="hide" to avoid having any results printed. * Use eval=FALSE to have the code shown but not evaluated. * Use warning=FALSE and message=FALSE to hide any warnings or messages produced. * Use fig.height and fig.width to control the size of the figures produced (in inches).

naming chunks = good practice (the above chunk was named pressure) * helps navigate around the document & this is what the figures will be named

(check the Rproject directory after knitting)

You can also include images from your local computer or from the web:

#!

1.3 Adding tables

Can type out tables:

col name
1 1 1
2 2 2

Alternatively, you can use the knitr package to make mardown tables from data frames:

speed dist
4 2
4 10
7 4
7 22
8 16
9 10

left, right, center adjust

1.4 Knitting

When you knit the file, an HTML file containing the code and output will be saved alongside it (click the Knit button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor (Viewer tab).

2 Rprojects

Rproject Benefits:

  • No need to set the working directory. All paths are relative to the directory containing the Rproject.

    Whenever you open your project, the working directory is automatically set to where your project is. This means your code will not break when you work on a different computer.

  • RStudio projects allow you to open multiple projects at the same time with each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.

Good organization / project lay out will:

  • ensure the integrity of your data
  • make it easier to collaborate
  • make it easier to a pick a project back up after a break

Project Management tips:

  • treat raw data as “read only”
  • create separate directory for “cleaned data” (or don’t save altered data files - will see later on with dplyr) - results
  • generated output is disposable (because your analysis is reproducible!)
  • put scripts in src directory
  • name all files to reflect their content or function (e.g. fig1_pca_communitycomposition.jpg not Rplot1.jpg)
  • avoid duplication - as code for a project matures, you will want to start splitting out functions into separate scripts. These scripts might be useful across multiple projects. When reusing a script, use a symbolic link to save space on your computer and avoid having to update a file in multiple places. Data that is reused can also be symbolically linked (ln -s)

data for this workshop

following good project management practices, make a new directory called data and download the data we will be playing with in this workshop into that directory:

In terminal tab:

mkdir data

cd data

wget https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv

curl

We will use the data later, but we can get a general sense of the data by looking at it in the terminal, which will help us decide how to load it into R later:

wc -l gapminder_data.csv

head gapminder_data.csv

cd -

3 GitHub

go to your GitHub account and make a new repository DO NOT initialize with a README

follow the instructions on the next page

(in terminal tab)

echo "# SkillPill_ReproducibleR" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/maggimars/SkillPill_ReproducibleR.git
git push -u origin master

README.md is a markdown file, just like this Rmarkdown file in many ways- uses similar syntax.

try also adding your data directory to your Github repository!

Alternatively - you can use the Rstudio interface to version control with Git https://swcarpentry.github.io/git-novice/14-supplemental-rstudio/

(I prefer command line)

4 A few notes on getting help

?function_name

If you can’t really remember a function name ??function_name

pro-tip From within the function help page, you can highlight code in the Examples and hit Ctrl+Return to run it in RStudio console. This is gives you a quick way to get a feel for how a function works.

?kable

for special operators use quotes, e.g. ?"<-" Without any arguments,vignette()will list all vignettes for all installed packages;vignette(package=“package-name”)will list all available vignettes for package-name, andvignette(“vignette-name”)will open the specified vignette. And then there is always google. # Reproducible and Streamlined Analyses (Day 2) ## Exploring the sample data We already looked at the sample data in Terminal and saw that it was a.csv` file with 1705 lines and that it does have a header.

gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)

View data in another tab with View()

when your data is in a github repo - you can also use it directly from the repo:

library(data.table) # you might need to install this package
gapminder<- fread("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", header = TRUE)

To get more information about the data:

dim()

str()

summary()

length()

nrow()

ncol()

names()

head()/tail()

other types we might see ?

str(factor(gapminder$continent))
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
ordering_example <- factor(gapminder$continent, levels= c("Oceania", "Asia", "Europe", "Africa", "Americas"))
str(ordering_example)
##  Factor w/ 5 levels "Oceania","Asia",..: 2 2 2 2 2 2 2 2 2 2 ...

The wrong structure of the data causes lots of problems

str(gapminder)
## Classes 'data.table' and 'data.frame':   1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##  - attr(*, ".internal.selfref")=<externalptr>

as.character() - change factors back into characters

as.numeric() - but need to use as.character() first

can also set stringsAsFactors = FALSE when reading in data

(gapminder$continent == "Asia")[c(1:100)]
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE

4.1 Subsetting Data Frames

removing rows and columns (subsetting)

by number

gapminder[-1,]
##           country year      pop continent lifeExp gdpPercap
##    1: Afghanistan 1957  9240934      Asia  30.332  820.8530
##    2: Afghanistan 1962 10267083      Asia  31.997  853.1007
##    3: Afghanistan 1967 11537966      Asia  34.020  836.1971
##    4: Afghanistan 1972 13079460      Asia  36.088  739.9811
##    5: Afghanistan 1977 14880372      Asia  38.438  786.1134
##   ---                                                      
## 1699:    Zimbabwe 1987  9216418    Africa  62.351  706.1573
## 1700:    Zimbabwe 1992 10704340    Africa  60.377  693.4208
## 1701:    Zimbabwe 1997 11404948    Africa  46.809  792.4500
## 1702:    Zimbabwe 2002 11926563    Africa  39.989  672.0386
## 1703:    Zimbabwe 2007 12311143    Africa  43.487  469.7093
gapminder[,-1]
##       year      pop continent lifeExp gdpPercap
##    1: 1952  8425333      Asia  28.801  779.4453
##    2: 1957  9240934      Asia  30.332  820.8530
##    3: 1962 10267083      Asia  31.997  853.1007
##    4: 1967 11537966      Asia  34.020  836.1971
##    5: 1972 13079460      Asia  36.088  739.9811
##   ---                                          
## 1700: 1987  9216418    Africa  62.351  706.1573
## 1701: 1992 10704340    Africa  60.377  693.4208
## 1702: 1997 11404948    Africa  46.809  792.4500
## 1703: 2002 11926563    Africa  39.989  672.0386
## 1704: 2007 12311143    Africa  43.487  469.7093

drop multiple rows/columns …

drop or select columns by name

gapminder[, c("year", "pop", "continent")]
##       year      pop continent
##    1: 1952  8425333      Asia
##    2: 1957  9240934      Asia
##    3: 1962 10267083      Asia
##    4: 1967 11537966      Asia
##    5: 1972 13079460      Asia
##   ---                        
## 1700: 1987  9216418    Africa
## 1701: 1992 10704340    Africa
## 1702: 1997 11404948    Africa
## 1703: 2002 11926563    Africa
## 1704: 2007 12311143    Africa
gapminder[ , -c("year", "pop", "continent")]
##           country lifeExp gdpPercap
##    1: Afghanistan  28.801  779.4453
##    2: Afghanistan  30.332  820.8530
##    3: Afghanistan  31.997  853.1007
##    4: Afghanistan  34.020  836.1971
##    5: Afghanistan  36.088  739.9811
##   ---                              
## 1700:    Zimbabwe  62.351  706.1573
## 1701:    Zimbabwe  60.377  693.4208
## 1702:    Zimbabwe  46.809  792.4500
## 1703:    Zimbabwe  39.989  672.0386
## 1704:    Zimbabwe  43.487  469.7093

select rows conditionally

gapminder[country == "Zimbabwe",]
##      country year      pop continent lifeExp gdpPercap
##  1: Zimbabwe 1952  3080907    Africa  48.451  406.8841
##  2: Zimbabwe 1957  3646340    Africa  50.469  518.7643
##  3: Zimbabwe 1962  4277736    Africa  52.358  527.2722
##  4: Zimbabwe 1967  4995432    Africa  53.995  569.7951
##  5: Zimbabwe 1972  5861135    Africa  55.635  799.3622
##  6: Zimbabwe 1977  6642107    Africa  57.674  685.5877
##  7: Zimbabwe 1982  7636524    Africa  60.363  788.8550
##  8: Zimbabwe 1987  9216418    Africa  62.351  706.1573
##  9: Zimbabwe 1992 10704340    Africa  60.377  693.4208
## 10: Zimbabwe 1997 11404948    Africa  46.809  792.4500
## 11: Zimbabwe 2002 11926563    Africa  39.989  672.0386
## 12: Zimbabwe 2007 12311143    Africa  43.487  469.7093
gapminder[country!= "Afghanistan",]
##        country year      pop continent lifeExp gdpPercap
##    1:  Albania 1952  1282697    Europe  55.230 1601.0561
##    2:  Albania 1957  1476505    Europe  59.280 1942.2842
##    3:  Albania 1962  1728137    Europe  64.820 2312.8890
##    4:  Albania 1967  1984060    Europe  66.220 2760.1969
##    5:  Albania 1972  2263554    Europe  67.690 3313.4222
##   ---                                                   
## 1688: Zimbabwe 1987  9216418    Africa  62.351  706.1573
## 1689: Zimbabwe 1992 10704340    Africa  60.377  693.4208
## 1690: Zimbabwe 1997 11404948    Africa  46.809  792.4500
## 1691: Zimbabwe 2002 11926563    Africa  39.989  672.0386
## 1692: Zimbabwe 2007 12311143    Africa  43.487  469.7093
gapminder[lifeExp >= 80,]
##             country year       pop continent lifeExp gdpPercap
##  1:       Australia 2002  19546792   Oceania  80.370  30687.75
##  2:       Australia 2007  20434176   Oceania  81.235  34435.37
##  3:          Canada 2007  33390141  Americas  80.653  36319.24
##  4:          France 2007  61083916    Europe  80.657  30470.02
##  5: Hong Kong China 1997   6495918      Asia  80.000  28377.63
##  6: Hong Kong China 2002   6762476      Asia  81.495  30209.02
##  7: Hong Kong China 2007   6980412      Asia  82.208  39724.98
##  8:         Iceland 2002    288030    Europe  80.500  31163.20
##  9:         Iceland 2007    301931    Europe  81.757  36180.79
## 10:          Israel 2007   6426679      Asia  80.745  25523.28
## 11:           Italy 2002  57926999    Europe  80.240  27968.10
## 12:           Italy 2007  58147733    Europe  80.546  28569.72
## 13:           Japan 1997 125956499      Asia  80.690  28816.58
## 14:           Japan 2002 127065841      Asia  82.000  28604.59
## 15:           Japan 2007 127467972      Asia  82.603  31656.07
## 16:     New Zealand 2007   4115771   Oceania  80.204  25185.01
## 17:          Norway 2007   4627926    Europe  80.196  49357.19
## 18:           Spain 2007  40448191    Europe  80.941  28821.06
## 19:          Sweden 2002   8954175    Europe  80.040  29341.63
## 20:          Sweden 2007   9031088    Europe  80.884  33859.75
## 21:     Switzerland 2002   7361757    Europe  80.620  34480.96
## 22:     Switzerland 2007   7554661    Europe  81.701  37506.42
##             country year       pop continent lifeExp gdpPercap

using & and |

4.2 Control Flow

if, if else, and for

allows us to control when an action is taken

# if
if (condition is true) {
  perform action
}
# if ... else
if (condition is true) {
  perform action
} else {  # that is, if the condition is false,
  perform alternative action
}

examples:

x <- 8
if (x >= 10) {
  print("x is greater than or equal to 10")
}
x
## [1] 8
x <- 8
if (x >= 10) {
  print("x is greater than or equal to 10")
} else {
  print("x is less than 10")
}
## [1] "x is less than 10"
x <- 8
if (x >= 10) {
  print("x is greater than or equal to 10")
} else if (x > 5) {
  print("x is greater than 5, but less than 10")
} else {
  print("x is less than 5")
}
## [1] "x is greater than 5, but less than 10"

Challenge:

Use an if() statement to print a suitable message reporting whether there are any records from 2002 in the gapminder dataset:

Looping

If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, a for() loop will do the job.

Basic Structure:

for (iterator in set of values) {
  do a thing
}

Example

for (i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Nested for loop:

for (i in 1:5) {
  for (j in c('a', 'b', 'c', 'd', 'e')) {
    print(paste(i,j))
  }
}
## [1] "1 a"
## [1] "1 b"
## [1] "1 c"
## [1] "1 d"
## [1] "1 e"
## [1] "2 a"
## [1] "2 b"
## [1] "2 c"
## [1] "2 d"
## [1] "2 e"
## [1] "3 a"
## [1] "3 b"
## [1] "3 c"
## [1] "3 d"
## [1] "3 e"
## [1] "4 a"
## [1] "4 b"
## [1] "4 c"
## [1] "4 d"
## [1] "4 e"
## [1] "5 a"
## [1] "5 b"
## [1] "5 c"
## [1] "5 d"
## [1] "5 e"

storing results

output_vector <- c()
for (i in 1:5) {
  for (j in c('a', 'b', 'c', 'd', 'e')) {
    temp_output <- paste(i, j)
    output_vector <- c(output_vector, temp_output)
  }
}

Challenge:

Write a script that loops through the gapminder data by continent and prints out whether the mean life expectancy is smaller or larger than 50 years.

4.3 Functions

reusable! (and therefore reproducible!)

Often start by writing a function within an interactive session.

Lets write a function that converts Fahrenheit to Celcius (bc I am moving to back to America and I’m going to need this)

fahr_to_celc <- function(temp) {
  celc <- ((temp - 32) * (5 / 9))
  return(celc)
}

get body temp in celcius: (seems to be important these days)

fahr_to_celc(98.6)
## [1] 37

Stopifnot

fahr_to_celc <- function(temp) {
  stopifnot(is.numeric(temp))
  celc <- ((temp - 32) * (5 / 9))
  return(celc)
}

What happens if you call with a number?

What if you call with a string?

Combining Functions:

Define two functions

  1. fahrenheit to celcius
  2. celcius to kelvin

Define a new function that calls both these functions to convert fahrenheit to kelvin

A more useful example:

Calculate gross domestic product in our data set

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat) {
  gdp <- dat$pop * dat$gdpPercap
  return(gdp)
}
calcGDP(head(gapminder))
## [1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

But that is not super useful - lets add more arguments so we can extract per country per year :

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap
  new <- cbind(dat, gdp=gdp)
  return(new)
}

default arguments are NULL

head(calcGDP(gapminder, year=2007))
##        country year      pop continent lifeExp gdpPercap         gdp
## 1: Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2: Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3: Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4: Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5: Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6: Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231
calcGDP(gapminder, country="Australia")
##           country year      pop continent lifeExp gdpPercap        gdp
##    1: Afghanistan 1952  8425333      Asia  28.801  779.4453 6567086330
##    2: Afghanistan 1957  9240934      Asia  30.332  820.8530 7585448670
##    3: Afghanistan 1962 10267083      Asia  31.997  853.1007 8758855797
##    4: Afghanistan 1967 11537966      Asia  34.020  836.1971 9648014150
##    5: Afghanistan 1972 13079460      Asia  36.088  739.9811 9678553274
##   ---                                                                 
## 1700:    Zimbabwe 1987  9216418    Africa  62.351  706.1573 6508240905
## 1701:    Zimbabwe 1992 10704340    Africa  60.377  693.4208 7422611852
## 1702:    Zimbabwe 1997 11404948    Africa  46.809  792.4500 9037850590
## 1703:    Zimbabwe 2002 11926563    Africa  39.989  672.0386 8015110972
## 1704:    Zimbabwe 2007 12311143    Africa  43.487  469.7093 5782658337

Challenge: Test out your GDP function by calculating the GDP for New Zealand in 1987. How does this differ from New Zealand’s GDP in 1952?

moving functions to rscripts and sourcing scripts (best practices for data management!)